迈向计算机视觉：为何选择CNN？

迈向计算机视觉

今天我们从使用基础线性层处理简单结构化数据，转向应对高维图像数据。一张彩色图像引入了显著的复杂性，标准架构无法高效处理。视觉领域的深度学习需要一种专门的方法：卷积神经网络（CNN）。

1. 为何全连接网络（FCN）会失效

在全连接网络中，每个输入像素都必须与后续层中的每一个神经元相连。对于高分辨率图像而言，这会导致计算量爆炸式增长，使训练变得不可行，并因严重过拟合而导致泛化能力差。

Input Dimension: A standard $224 \times 224$ RGB image results in $150,528$ input features ($224 \times 224 \times 3$).
Hidden Layer Size: If the first hidden layer uses 1,024 neurons.
Total Parameters (Layer 1): $\approx 154$ million weights ($150,528 \times 1024$) just for the first connection block, requiring massive memory and compute time.

CNN 的解决方案

CNN 通过利用图像的空间结构来解决 FCN 的扩展性问题。它们使用小型滤波器识别模式（如边缘或曲线），将参数数量减少数个数量级，并提升模型鲁棒性。

TERMINALbash — model-env

> Ready. Click "Run" to execute.

PARAMETER EFFICIENCY INSPECTOR Live

Run comparison to visualize parameter counts.

Question 1

What is the primary benefit of using Local Receptive Fields in CNNs?

Filters only focus on a small, localized region of the input image.

It allows the network to process the entire image globally at once.

It ensures all parameters are initialized to zero.

It removes the need for activation functions.

Question 2

If a $3 \times 3$ filter is applied across an entire image, what core CNN concept is being utilized?

Kernel Normalization

Shared Weights

Full Connectivity

Feature Transposition

Question 3

Which CNN component is responsible for progressively reducing the spatial dimensions (width and height) of the feature maps?

ReLU Activation

Pooling Layers (Subsampling)

Batch Normalization

Challenge: Identifying Key CNN Components

Relate CNN mechanisms to their functional benefits.

We need to build a vision model that is highly parameter efficient and can recognize an object even if it slightly shifts its position in the image.

Step 1

Which mechanism ensures the network can identify a feature (like a diagonal line) regardless of where it is in the frame?

Solution:
Shared Weights. By using the same filter across all locations, the network learns translation invariance.

Step 2

What architectural choice allows a CNN to detect features with fewer parameters than an FCN?

Solution:
Local Receptive Fields (or Sparse Connectivity). Instead of connecting to every pixel, each neuron only connects to a small, localized region of the input.

Step 3

How does the CNN structure lead to hierarchical feature learning (e.g., edges $\to$ corners $\to$ objects)?

Solution:
Stacked Layers. Early layers learn simple features (edges) using convolution. Deeper layers combine the outputs of earlier layers to form complex, abstract features (objects).